# Full text of "Mathematical Contributions to the Theory of Evolution.--On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs"

## See other formats

Mathematical Contributions to the Theory of Evolution. 489 was always less than the standard error when only x 2 was taken into account, unless Piz = 0. We may now prove the similar, theorem that when we use three variables, x 29 a? 3 , x h on which to base the estimate, the standard error will be again decreased, unless pu = 0. The condition that S(% 2 ), in our present case, shall be less than S(r 3 ) in the last, is, in fact, f r 12 2 + r 13 2 + r u 2 - r ia 3 r 34 2 — %V 14 2 - r 13 ,V 24 2 1 J _ 2(r n r l4 ;r u + r V zr l zr23+r n r 1 4r 2i ) >(l-~r 23 2 ) I + 2 (r 12 r u r^r u + r u r l% r 2i r 2Z +r n r n r u r u ) J > (rn+r u 2 -~2r n r l3 r 2Z ) (l—roJ~-r 2i 2 —r u 2 + 2r 23 r 2 4r 3 4). This may be finally reduced to — (r l4 — r 13 r 34 -~- r n r 2 i—r H r 2 ^+r n r^r 2i ^r 12 r 2d r u y > 0, that is /> l4 3 > 0. The treatment of the general case of n variables, so far as regards obtaining the regressions, is obvious, and it is unnecessary to give it at length. We can now see that the use of normal regression formulas is quite legitimate in all cases, so long as the necessary limitations of inter- pretation are recognised. Bravais' r always remains a coefficient of correlation. These results 1 must plead as justification for my use of normal formulas in two cases^ where the correlation was markedly non-normal. " Mathematical Contributions to the Theory of Evolution. — On a Form of Spurious Correlation which may arise when Indices are used in the Measurement of Organs." By Karl Pearson, F.R.S., University College, London. Ee- ceived December 29, 1896,— Read February 18, 1897. (1) If the ratio of two absolute measurements on the same or different organs be taken it is convenient to term this ratio an index, If u = jfi(aj, y) and v = / 2 (z, y) be two functions of the three variables x, y, z y and these variables be selected at random so that there exists no correlation between x,y> y,z, or z,x 9 there will still be found to * ' Economic Journal/ Dec, 1895, and Bee, 1896, « On the Correlation of Total Pauperism with Proportion of Out-relief." 490 Prof, Karl Pearson. exist correlation between u and v. Thus a real danger arises when a. statistical biologist attributes the correlation between two functions, like u and v to organic relationship. The particular case that i? likely to occur is when u and v are indices with the same denominator for the correlation of indices seems at first sight a very plausible measure of organic correlation. The difficulty and danger which arise from the use of indices was brought home to me recently in an endeavour to deal with a consider- able series of personal equation data. In this case it was convenient to divide the errors made by three observers in estimating a variable quantity by the actual value of the quantity. As a result there appeared a high degree of correlation between three series of abso- lutely independent judgments. It was some time before I realised that this correlation had nothing to do with the manner of judging, but was a special case of the above principle due to the use of indices. A further illustration is of the following kind. Select three num- bers within certain ranges at random, say x, y, z, these will be pair and pair uncorrected. Form the proper fractions xjy and zjy for each triplet, and correlation will be found between these indices. The application of this idea to biology seems of considerable importance. For example, a quantity of bones are taken from an ossuarium, and are put together in groups, which are asserted to be those of individual skeletons. To test this a biologist takes the triplet femur, tibia, humerus, and seeks the correlation between the indices femur j humerus and tibia / humerus. He might reasonably conclude that this correlation marked organic relationship, and believe that the bones had really been put together substantially in their individual grouping. As a matter of fact, since the coefficients of variation for femur, tibia, and humerus are approximately equal, there would be, as we shall see later, a correlation of about 0*4 to 0*5 between these indices had the bones been sorted absolutely at random. I term this a spurious organic correlation, or simply a spurious correlation. I understand by this phrase the amount of correlation which would still exist between the indices, were the absolute lengths on which they depend distributed at random. It has hitherto been usual to measure the organic correlation of the organs of shrimps, prawns, crabs, &c, by the correlation of indices in which the denominator represents the total body length or total cara- pace length. ISFow suppose a table formed of the absolute lengths and the indices of, say, some thousand individuals. Let an "imp " (allied to the Maxwellian demon) redistribute the indices at random, they would then exhibit no correlation* if the corresponding absolute lengths followed along with the indices in the redistribution, they also would exhibit no correlation. £Tow let us suppose the indices not to have been calculated, but the imp to redistribute the abso- Mathematical Contributions to. the. Theory of Evolution. 491 lute lengths; these would now exhibit no organic correlation,, but the indices calculated from this random distribution would have a correlation nearly as high, if not in some cases higher than before. The biologist would be not unlikely to argue that the index correla- tion of the imp-assorted, but probably, from the vital standpoint, impossible beings was " organic." As a last illustration, suppose 1000 skeletons obtained by distribut- ing component bones at random. Between none of their bones will these individuals exhibit correlation. Wire the spuriousr , skeletons, together and photograph them all, so that their stature in the photo- graphs is the same ; the series of photographs, if measured, will show correlation between their parts. It seems to me that tbe biologist who reduces the parts of an animal to fractions of some one length measured upon it is dealing with a series very much like these pho- tographs. A part of the correlation he discovers between organs is undoubtedly organic, but another part is solely due to, the nature of his arithmetic, and as a measure of organic relationship is spurious. Beturning to our problem of the randomly distributed bones, let us suppose the indices femur/hum erus and tibia/humerus to have a correlation of 0*45. Now suppose successively 1, 2, 3, 4, &c, per cent, of the bones are assorted in their true groupings, then begins the true organic correlating of the bones. It starts from 0*45, and will alter gradually until 100 per cent, of the bones are truly grouped. The final value may be greater or less than 0*45, but it would seem that 0*45 is a more correct point to measure tbe organic correlation from than zero. At any rate it appears fairly certain that if a biologist recognised that a perfectly random selection of organs would still lead to a correlation of organ-indices, be would be unlikely to accept index-correlation as a fair measure of the rela- tive intensity of correlation between organs. I shall accordingly define spurious organic correlation as the correlation which will be found between indices, when the absolute values of the organs have been selected purely at random. In estimating relative correlation by the hitherto usual measurement of indices, it seems to me that a statement of the amount of spurious correlation ought always to be made. (2) Proposition I. — To find the mean of an index in terms of the means, coefficients of variation, and coefficient of correlation of the two absolute measurements* Let a?!, x 2 , # 3 , a? 4 be the absolute sizes of any four correlated organs ; mi, m 2 , m 3 , mi their mean values ; <r l9 <r 2 > ^3, <** their standard deviations ; ** In all that follows, unless otherwise stated, the correlation may be of any kind whatever, i.e., the frequencies are not supposed to follow the Gaussian or normal law of error. 492 Prof. Karl Pearson. ■0i» ^2» ^3» ^4 their coefficients of variation, i.e, 9 <ri/«&i, o" 2 /m 2 , 03/^3, <r 4 /m 4 respectively ; n 2 , r 23 , r 34 , r 41 , r 24 , ^13, the six coefficients of corre- lation ; e l9 e 2 , e 3 , 64 the deviations of the four organs from their means, i,e, a?x = mi + e l9 x% = m 2 -f-€ 2 , a? 3 = m 3 4-e 3 , #4 = m 4 -fe 4 ; % 3 the mean value of the index xjx^ and ^ 24 the mean value of x 2 /xi ; 2 1? 2 2 the standard deviations of the indices Xi/x 3 and # 2 /# 4 respectively ; and n the total number of groups of organs. We shall suppose the ratios of the deviations to the mean absolute values of the organs are so small that their cubes may be neglected. Then w- \aj 3/ / wm 3 L\ mj/ \ m 3 / J lmj S( ei ) S(«.) S(6 l£3 ) S(e 3 2 )\ nm 3 \ m x m 3 Wim 3 ?w 3 / if we neglect quantities of the third order in e/m. But S(€j) = S(g 3 ) = 0, S(6j6 3 ) = Wi3«Tiff 3 > and S(c 3 3 ) = ^er 3 2 . Hence: . Ww, , , N /.\ *i3 = — (1-f-Va - — ns^i^) , CO- Similarly . m 2 e . . * % ,. n J * 24 = _i (l + v* 2 —^^) (11). Thus we see that the mean of an index is not the ratio of the means of the corresponding absolute measurements, but differs by a quan- tity depending on the correlation and variation coefficients of the absolute measurements. (3) Proposition II — To find the standard deviation of an index in terms of the coefficients of variation, and coefficient of correlation of the two absolute measurements. \% / 2 m? I \ mij \ m 3 l J = — „ < S { h square terms > m 3 L \mi m 3 /J = ^i3 2 O^i 2 + n-Vs 2 — 2 nr^v^s) , if we neglect cubic terms. . * . 2S I3 = iisv(y* + v^-~2ri$ViV 3 ). (iii). Mathematical Contributions to the Theory of Evolution. 493 (4) Proposition III. — To find the coefficient of correlation of two indices in terms of the coefficients of correlation of the four absolute measurements and their coefficients of variation. Let Xijx-i and x 2 \x± be the two indices. w/>2 18 224 = S ( —— -i 13 ) ( ~-~ i 2 * m i^2 J,, e, € 3 e,e 3 e 3 2 1-j . 1 - — i— .^ +ruVivA m 2 m 4 m 2 m 4 ?n> 4 \mi m 3 / \m 2 m 4 , if we neglect terms of the cubic order. Hence, finally, r 1 gglgg -~ r J 4^4 — ^3^3 + ^4^4 V Vl 2 + ^ 3 2 — 2 riai;^ ^ v* + v* — 2 r 2i v 2 Vt (5) Thus we have expressed p in terms of the four coefficients of correlation and the four coefficients of variation of the absolute measurements which form the indices. We may draw the following conclusions : (i.) The correlation between two indices will always vanish when the four absolute measurements forming the indices are quite uncor- related, (ii.) If two of the organs are perfectly correlated, let us say made identical : for example, the third and fourth, so that r 34 = 1, and v 3 = t% we find r n v iVj — r 13 ^ 3 — gg gaga ± v* \ / v 1 2 + v<?—2r li v l v. <i */v2 2 + v 3 2 —2r 2 3V 2 Vz This is the coefficient of correlation between two indices with the same denominator (xi/x^ and #2/^3) ♦ The value of p in (v) does not vanish if the remaining organs be quite uncorrelated, i.e., r l2 = n 3 = r K = 0. In this case This is the measure of the spurious correlation. For the special 494 Prof. Karl Pearson. case in which the coefficients of variation are all the same, p = 0*5. When the absolute sizes of organs are very feebly correlated, then in most cases there will be a considerable correlation of indices. Example (&). Suppose three organs, x u x%, and x % to have sensibly equal coefficients of variation, and that the correlation of Xi and x 2 = r n = r and of x x and x s , as well as of x 2 and x 3 = r'. Then: 1+r— 2r' 0*5 X 0*5 + 0*5 l-V r— r 1— r' This formula illustrates well in a specially simple case how the correlation in the indices diverges from the spurious value 0*5, as we alter r and r from zero, i.e., as we introduce organic correlation. According as r, the correlation of the numerators, is greater or less than r , the correlation of the numerator with the denominator, the actual index correlation can be greater or less than the spurious value. Example (b). If z h z 2 be the indices, then in the case of normal correlation the contour lines of the correlation surface for the indices are given by hzrr? ^ = constant, 2 1 1 (l-/>*) 2,2,(1-/,*) 2/(1-^) where 2 b 2 2 , and p are given by (iii) and (iv) above. The contour lines of a surface of spurious index correlation are given by — 1-j — - —2 (-— H — = = constant, while the uncorrected distribution of the numerators x x and x % is given by the contours, %il<r? + %£lcr£ = constant. We are thus able to mark the growth of the spurious correlation as we increase v 3 from zero ; we see the axes of the ellipses diminishing and their directions beginning to rotate. Example (c). To find the spurious correlation between the two chief cephalic indices. I have calculated the following results from the measurements made on 100 " Altbayerisch " S skulls, by Professor J. Ranke. See his * Anthropologic der Bayern,' Bd. i, Kapifcel v, S. 194. Mathematical Contributions to the Theory of Evolution. 495 Breadth of skull :* m x r= 150*47, c x == 5*8488, v x = 3*8871. Height of skull : m 2 = 133*78, <r 2 = 4'6761, v 2 = 3*4954. Length of skull : m z = 180*58, <x 3 =- 5*8441, v 3 = 3*2363. Cephalic index, B/L : t 13 = 83*41, 2 l3 == 3*5794, Y 13 .== 4*2913. Cephalic index, H/L : £>3 = 74*23, S 33 = 3*6305, V 23 = 4*8909. Cephalic index, H/B : ^ = 89*12, 2 21 = 4*l752, t V 2 i =4*6849. The coefficients of correlation may at once be deduced : Breadth and length : r 13 = (vf+v 3 2 -- Vi3 2 )/(2tf 2 v 3 ) .=. 0*2849. Height and length : r 23 = (tf 2 2 4-^3 2 — V 23 2 )/(2v 2 v 3 ) = —0*0543. Height and breadth : r^ = (^ 2 2 4-^i 2 — V 2 i 2 )/(2v 1 v 2 ) = 0*1243., This is the first table, so far as I am aware, that has been published of the variation and correlation of the three chief cephalic lengths.f It shows us that there is not at all a close correlation between these chief dimensions of the skull, and that a small compensating factor for size is to be sought in the correlation of height and length, i.e., while a broad skull is probably a long skull and also a high skull, a high skull will probably be a short skull, and a low skull a long skull. Without substituting the values of v h v 2 , v h r n , r 13 , r 33 in (v), we can find p, or the correlation between bread th/lengfch and height/length indices from : P = (V 13 2 +V 23 *-V 12 2 )/(2V 13 V 23 ). This follows at once from the general theorem given in my memoir on " Regression, Panmixia, and Heredity," ' Phil. Trans.,' vol. 187, A, p. 279, or by substitution of the above values of r 12 , r m , r 23 in (v), we find t p = 0*4857. If we calculate from (vi) the correlation between the same cephalic indices on the hypothesis that their heights, breadths and lengths are distributed at random, i.e., that our "Imp" has constructed a number of arbitrary and spurious skulls from Professor Ranke's measurements, we find : p = 0*4008. It seems to me that a quite erroneous impression would be formed of the organic correlation of the human skull, did we judge it by the magnitude of the correlation coefficient (0:4857) for the two chief * All the absolute measures given are in millimetres, and the coefficients of variation are percentage variations, i.e., they must be divided by 100 before being used in formulae (i), (ii), and (iii). f I hope later to treat correlation in man with reference to race s sex, and organ, as I have treated variation. 496 Prof. Karl Pearson. cephalic indices, for no less than 0*4008 of this would remain, if we destroyed all organic relationship between the lengths on which these indices are based. Example (d). To find the spurious correlation behveen the indices femur J humerus and femur j tibia. The following results have been calculated* from measurements made by Koganei on Aino skeletons. (See ' Mittheilungen aus der medicinischen Facultat der K. J. Universitat, Tokio,' Bd. I. Tables.) I have kept the sexes apart although there are but few of each. 6* Skeletons. Number = 40 to 44. Measurements in centimetres. Femur, F : m x = 40*845, <r x = 1*957, v x = 4*792. Tibia, T : %= 31*740, a 2 = 1*577, v 2 = 4*970. Humerus, H : m^ = 29*593, <r 3 = 1*337, v 3 = 4*517. The following coefficients of correlation were calculated directly ; Femur and tibia : r X2 =' 0*8266. Femur and humerus : r 13 = 0*8585. Tibia and humerus : r 23 == 0*7447. From these were deduced by the formulse of this paperf : — > Index, F/T : i l2 = 128*75, S ia = 3*7075. Y 13 = 2*8795. Index, F/H : i 13 = 137*92, 2 18 = 3*4084, Y 13 = 2*4714. Index, T/H: i 2Z = 107'02, 2 23 = 3*6675, Y 23 = 3*4271. Hence we find for the correlation of the indices F/H and T/H : p = 0*5644. But the spurious correlation, if the bones had been grouped at random would have been Po = 0*4557. * I have to thank Miss Alice Lee for a considerable part of the arithmetic work of this example. f The values for the indices are not in absolute agreement with those to be deduced from the lengths, for it was not always possible to use the same skeleton for femur and humerus as for tibia and humerus, i.e., sometimes one or other bone was missing. For the same reason, the constants for the absolute lengths do not agree entirely with those given for Ainos in my paper on " Variation in Man and Woman" in * The Chances of Death and other Studies in Evolution,* vol. 1, p. 303), for the simple reason that I there used every available bone, and not every available pair, as here. Mathematical Contributions to the Theory of Evolution. 497 Tabulating the corresponding quantities for the other sex^ -we find '<j> Skeletons. Number = 22 to 24. Measurements in centimetres. Femur, P : m x — 38*075, a Y = 1*494, v x — 3*924. Tibia, T : m 2 = 29*800, <x 3 = 1*576, v 2 = 5*289. Humerus, H : m 3 = 27*565, <r 3 = 1*109, v 5 = 4*022. Femur and tibia : r 12 = 0*8457. Femur and humerus : r J3 =. 0*8922. Tibia and humerus : r 23 = 0*7277. Index, P/T : * 12 = 127*90, 2 12 = 3*8937, V l2 = 3*0444. Index, P/H : i 13 = 138*37, S 18 = 2*6930, Y 13 = 1*9462. Index, T/H : z 23 = 108*36, 2 23 = 4*1022, V i3 = 3*7857. ^ = 0*6006. /Jo = 0*3904. Hence we may conclude as follows : (i) The absolute lengths of the long bones differ from those of the skull in being very closely correlated. (ii) The use of indices for the long bones would appear to mini- mise, rather than, as in the case of the skull, to exaggerate this correlation. (iii) If we measure, however, organic correlation of the indices by p—pQ, we shall find index correlation less than absolute length corre- lation for both long bones and skull, and in both cases the former comparatively small as compared with the latter. (iv) The results for the 24 female skeletons, although based on but few data, serve on the whole to confirm the male results.* (6.) Prom the above examples it will be seen that the method, which judges of the intensity of organic correlation by the reduction of all absolute measures to indices, the denominators of which are some one absolute measurement, is not free from obscurity ; for this method would give the major portion of the observed index corre- lation had the parts of the animal been thrown together entirely at random, i.e., if there were no organic correlation at all. The follow- ing additional remarks may be of interest. The results (iv) — (vi) show us that the correlation coefficients of indices are functions, not only of the correlation coefficients of absolute measurements, bat also of the coefficients of variation of the latter measurements. Hence, * The fact that the male is more variable in height- sitting, in femur, and in tibia than the female, while she appears to be more variable than he is in stature, led me to prophesy, in my paper on " Variation in Man and Woman," that the female would be found to be more closely correlated in the bones forming stature than the male. This appears to be the case for the femur and tibia of Ainos. VOL. LX. 2 P 498 Dr. F. Galton. Note to the Memoir by unless the coefficients of variation be constant for local races, it is impossible that the coefficients of correlation can be constant for indices. In other words, the hypothesis of the constancy for local races of correlation, and that of the constancy for local races of variation, stand on exactly the same footing. The conclusions of this paper although applied to organic correla- tion are equally valid so far as concerns the use of indices in judging the correlation of either physical or economic phenomena. It was, indeed, a difficulty arising from my discussion of personal judgments — a spurious correlation between the judgments of different observers — which first drew my attention to the matter. Note, January 13, 1897. — The result described by Professor Pearson evidently affects the value of the correlation coefficients determined by me in Grangon and Carcinus (' Roy. Soc. Proc./ vols* 51 and 54), because I have always expressed the size of the organs measured in terms of body length. In order to show the effect of this, I have lately performed, at Professor Pearson's suggestion, the following experiment : It happens that my measures of Plymouth shrinrps are recorded in a book, in the order in which they were measured, and therefore at random as regards carapace length or other characters. I constructed from these records 420 "spurious" shrimps, in the following way: the total length of the first shrimp in the book was associated with the carapace length of the tenth shrimp and the " posfc-spinous length " of the twentieth, and so throughout. Evidently these three measures were associated at random, and we might expect that these spurious shrimps would show no organic correlation ; but w r hen the cara- pace lengths and " post-spinous lengths " of these spurious shrimps were divided by the body length, and the correlation between the resulting indices was determined, the value of r was found to be 0'S8 7 the value for real shrimps being 0*81, or the correlation due to the use of indices forms 47 per cent, of the observed value. W. P. R. Weldok. " Note to the Memoir by Professor Karl Pearson, F.K.S., on Spurious Correlation." By Francis GrALTON, F.R.S. Re- ceived January 4, — Read February 18, 1897. I send this note to serve as a kind of appendix to the memoir of Professor K. Pearson, believing that it may be useful in enabling others to realise the genesis of spurious correlation. It is important though rather difficult to do so, because the results arrived at in the memoir, which are of serious interest to practical statisticians, have at first sight a somewhat paradoxical appearance.